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ABSTRACT 

A  desire  with  iterative  optimization  techniques  is  that  the  algorithm  reaches  the  global  optimum  rather 
than  get  stranded  at  a  local  optimum  value.  In  this  paper,  we  examine  the  theoretical  and  numerical 
global  convergence  properties  of  a  certain  “gradient  free”  stochastic  approximation  algorithm  called 
“SPSA,”  that  has  performed  well  in  complex  optimization  problems.  We  establish  two  theorems  on  the 
global  convergence  of  SPSA.  The  first  provides  conditions  under  which  SPSA  will  converge  in 
probability  to  a  global  optimum  using  the  well-known  method  of  injected  noise.  The  injected  noise 
prevents  the  algorithm  from  converging  prematurely  to  a  local  optimum  point.  In  the  second  theorem, 
we  show  that,  under  different  conditions,  “basic”  SPSA  without  injected  noise  can  achieve 
convergence  in  probability  to  a  global  optimum.  This  occurs  because  of  the  noise  ejfectively  (and 
automatically)  introduced  into  the  algorithm  by  the  special  form  of  the  SPSA  gradient  approximation. 
This  global  convergence  without  injected  noise  can  have  important  benefits  in  the  setup  (tuning)  and 
performance  (rate  of  convergence)  of  the  algorithm.  The  discussion  is  supported  by  numerical  studies 
showing  favorable  comparisons  of  SPSA  to  simulated  annealing  and  genetic  algorithms. 

KEYWORDS:  Stochastic  Optimization,  Global  Convergence,  Stochastic  Approximation, 
Simultaneous  Perturbation  Stochastic  Approximation  (SPSA),  Recursive  Annealing 

1.  INTRODUCTION 

A  problem  of  great  practical  importance  is  the  problem  of  stochastic  optimization,  which  may  be  stated 
as  the  problem  of  finding  a  minimum  point,  e*  e  Rp ,  of  a  real- valued  function  L( e) ,  called  the  “loss 
function,”  that  is  observed  in  the  presence  of  noise.  Many  approaches  have  been  devised  for  numerous 
applications  over  the  long  history  of  this  problem.  A  common  desire  in  many  applications  is  that  the 
algorithm  reaches  the  global  minimum  rather  than  get  stranded  at  a  local  minimum  value.  In  this  paper, 
we  consider  the  popular  stochastic  optimization  technique  of  stochastic  approximation  (SA),  in 
particular,  the  form  that  may  be  called  “gradient-free”  SA.  This  refers  to  the  case  where  the  gradient, 
g(0)  =  3L( e)/ae  ,  of  the  loss  function  is  not  readily  available  or  not  directly  measured  (even  with  noise). 
This  is  a  common  occurrence,  for  example,  in  complex  systems  where  the  exact  functional  relationship 
between  the  loss  function  value  and  the  parameters,  e  ,  is  not  known  and  the  loss  function  is  evaluated 
by  measurements  on  the  system  (or  by  other  means,  such  as  simulation).  In  such  cases,  one  uses 
instead  an  approximation  to  g(0)  (the  well-known  form  of  SA  called  the  Kiefer- Wolfowitz  type  is  an 
example). 
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The  usual  form  of  this  type  of  SA  recursion  is: 

®k+\  =  ~ ak Sk <9* )  >  ( 1 ) 

where  gk( e )  is  an  approximation  (at  the  kth  step  of  the  recursion)  of  the  gradient  g(6) ,  and  {ak }  is  a 
sequence  of  positive  scalars  that  decreases  to  zero  (in  the  standard  implementation)  and  satisfies  other 
properties.  This  form  of  SA  has  been  extensively  studied,  and  is  known  to  converge  to  a  local  minimum 
of  the  loss  function  under  various  conditions. 

Several  authors  (e.g.,  Chin  (1994),  Gelfand  and  Mitter  (1991),  Kushner  (1987),  and  Styblinski 
and  Tang  (1990))  have  examined  the  problem  of  global  optimization  using  various  forms  of  gradient- 
free  SA.  The  usual  version  of  this  algorithm  is  based  on  using  the  standard  “finite  difference”  gradient 
approximation  for  gk(Q).  It  is  known  that  carefully  injecting  noise  into  the  recursion  based  on  this 
standard  gradient  can  result  in  an  algorithm  that  converges  (in  some  sense)  to  the  global  minimum.  For  a 
discussion  of  the  conditions,  results,  and  proofs,  see,  e.g.,  Fang  et  al.  (1997),  Gelfand  and  Mitter 
(1991),  and  Kushner  (1987).  The  amplitude  of  the  injected  noise  is  decreased  over  time  (a  process 
called  “annealing”),  so  that  the  algorithm  can  finally  converge  when  it  reaches  the  neighborhood  of  the 
global  minimum  point. 

A  somewhat  different  version  of  SA  is  obtained  by  using  a  “simultaneous  perturbation”  gradient 
approximation,  as  described  in  Spall  (1992)  for  multivariable  ( p  >  t )  problems.  The  gradient 
approximation  in  simultaneous-perturbation  SA  (SPSA)  is  much  faster  to  compute  than  the  finite- 
difference  approximation  in  multivariable  problems.  More  significantly,  using  SPSA  often  results  in  a 
recursion  that  is  much  more  economical,  in  terms  of  loss-function  evaluations,  than  the  standard  version 
of  SA.  The  loss  function  evaluations  can  be  the  most  expensive  part  of  an  optimization,  especially  if 
computing  the  loss  function  requires  making  measurements  on  the  physical  system.  Several  studies 
(e.g.,  Spall  (1992),  Chin  (1997))  have  shown  SPSA  to  be  very  effective  in  complex  optimization 
problems.  A  considerable  body  of  theory  has  been  developed  for  SPSA  (Spall  (1992),  Chin  (1997), 
Dippon  and  Renz  (1997),  Spall  (2000),  and  the  references  therein),  but,  because  of  the  special  form  of 
its  gradient  approximation,  existing  theory  on  global  convergence  of  standard  SA  algorithms  is  not 
directly  applicable  to  SPSA.  In  Section  2  of  this  paper,  we  present  a  theorem  showing  that  SPSA  can 
achieve  global  convergence  (in  probability)  by  the  technique  of  injecting  noise.  The  “convergence  in 
probability”  results  of  our  Theorem  1  (Section  2)  and  Theorem  2  (Section  3)  are  standard  types  of 
global  convergence  results.  Several  authors  have  shown  or  discussed  global  convergence  in  probability 
or  in  distribution  (Chiang  et  al.  (1987),  Gelfand  and  Mitter  (1991),  Gelfand  and  Mitter  (1993),  Geman 
and  Geman  (1984),  Fang  et  al.  (1997),  Hajek  (1988),  Kushner  (1987),  Yakowitz  et  al.  (2000),  and 
Yin  (1999)).  Stronger  “almost  sure”  global  convergence  results  seem  only  to  be  available  by  using 
generally  infeasible  exhaustive  search  (Dippon  and  Fabian  (1994))  or  random  search  methods 
(Yakowitz  (1993)),  or  for  cases  of  optimization  in  a  discrete  (e  -)  space  (Alrefaei  and  Andradottir 
(1999)). 

The  method  of  injection  of  noise  into  the  recursions  has  proven  useful,  but  naturally  results  in  a 
relative  slowing  of  the  rate  of  convergence  of  the  algorithm  (e.g.,  Yin  (1999))  due  to  the  continued 
injection  of  noise  when  the  recursion  is  near  a  global  solution.  In  addition,  the  implementation  of  the 
extra  noise  terms  adds  to  the  complexity  of  setting  up  the  algorithm.  In  Section  3,  we  present  a  theorem 
showing  that,  under  different  (more  demanding)  conditions,  the  basic  version  of  SPSA  can  perform  as  a 


global  optimizer  without  the  need  for  injected  noise.  Section  4  contains  numerical  studies 
demonstrating  SPSA’s  performance  compared  to  two  other  popular  strategies  for  global  optimization, 
namely,  simulated  annealing  and  genetic  algorithms;  and  Section  5  is  a  summary.  The  Appendix 
provides  some  technical  details. 

2.  SPSA  WITH  INJECTED  NOISE  AS  A  GLOBAL  OPTIMIZER 

Our  first  theorem  applies  to  the  following  algorithm,  which  is  the  basic  SPSA  recursion  indicated  in 
equation  (1),  modified  by  the  addition  of  extra  noise  terms: 

eft+i  =0r -akik(®  ki+qk^k  >  (2) 

where  < okeRp  is  i.i.d.  N(0,i)  injected  noise,  ak=alk,  qk=qtk  logiogfc,  a>  o,  q  >  o .  and  §k(»)  is  the 
“simultaneous  perturbation”  gradient  defined  as  follows:  (3) 

gk( e  )  =  (2ckAk  r1  [L(0  +  ck  Ak )  -  L(0  -  ckAk )  +  e{+)  -e[,_)  ] , 
where  ck ,  e^±)  are  scalars,  A keRp  ,  and  the  inverse  of  a  vector  is  defined  to  be  the  vector  of  inverses. 
This  gradient  definition  follows  that  given  in  Spall  (1992).  The  ek  terms  represent  (unknown)  additive 
noise  that  may  contaminate  the  loss  function  observation,  ck  and  Akl  are  parameters  of  the  algorithm, 
the  c k  sequence  decreases  to  zero,  and  the  Akl  components  are  chosen  randomly  according  to  the 
conditions  in  Spall  (1992),  usually  (but  not  necessarily)  from  the  Bernoulli  ( ±l )  distribution.  (Uniformly 
or  normally  distributed  perturbations  are  not  allowed  by  the  regularity  conditions.) 

In  this  Section,  we  will  refer  to  Gelfand  and  Mitter  (1991)  as  GM91.  Our  theorem  on  global 
convergence  of  SPSA  using  injected  noise  is  based  on  a  result  in  GM91.  In  order  to  state  the  theorem, 
we  need  to  develop  some  notation,  starting  with  the  definition  of  a  key  probability  measure,  n11 ,  used  in 
hypothesis  H8  below.  Define  for  any  r|  >  o :  (0 )/dQ  =exp(-2L(0  )/ri2)/zT| ,  where 

z11  =  f  exp(-2L(0 )  /ri  2)rf0  .  Next,  define  an  important  constant,  c0 ,  for  convergence  theory  as  follows 

JRI 

(GM91).  For  t  e  R  and  e  Rp  ,  let 

h  t A>  l  .V  2 )  =  inf  2-  f  I  d§  ( 5) /  ds  +  g  (p  ( s))  I2  ds  , 

where  the  inf  is  taken  over  all  absolutely  continuous  functions  p  :  R  Rp  such  that  p(0)  =n|  and  pp)  =v2 , 
and  1*1  is  the  Euclidean  norm.  Let  v(u1,u2)  =  lim  i(t,vhv2) ,  and  s0  =  (0  I  g(0 )  =  0} .  Then 

t—>  oo 

C0=  |  sup  (V(VhV2)-2L(x>2)) . 

Vi,V2  eS0 

We  will  also  need  the  following  definition  of  tightness.  If  k  is  a  compact  subset  of  Rp  and  { xk  }  is  a 
sequence  of  random  p  -dimensional  vectors,  then  {  xk  }  is  tight  in  k  if  x()  e  K  and  for  any  e  >  o ,  there 
exists  a  compact  subset  k£  <zRp  such  that  P(Xke  Ke)>l-e,\/k>0  .  Finally,  let  l,k=gk^k)-g^k)  and  let 
superscript  prime  (')  denote  transpose. 

The  following  are  the  hypotheses  used  in  Theorem  1. 

HI. .  Let  Ak  e  Rp  be  a  vector  of  p  mutually  independent  mean-zero  random  variables 

{Akl,Ak2,...,Akp )'  such  that  \Ak }  is  a  mutually  independent  sequence  that  is  also  independent  of 


the  sequences  {0|,...,e^-i  {e1(±),...,e^)1  j ,  and  {co1,...,coyi._1},  and  such  that  Aki  is  symmetrically 

distributed  about  zero,  l  Aki  l<oq  <c»  a.s.  and  e  i  a~;2  i <  «2  <°°  ,  a.s.  v  kk  . 

H2..Let  £(,+)  and  e[_)  represent  random  measurement  noise  terms  that  satisfy  Ek  (e^+)-ef~))  =  o 
a.s.  Vk  ,  where  Ek  denotes  the  conditional  expectation  given  Zk=  the  sigma  algebra  induced  by 
{0~o,o)1,...,co£_1,^i*,...,£i'_i}  •  The  {e[±}}  sequences  are  not  assumed  independent.  Assume  that 
Ek[(£(k±))2]<a3  <  oo  a.s.  Vk  . 

H3.l(9)  is  a  thiice  continuously  differentiable  map  from  Rp  into  Rl ;  L( 9)  attains  the  minimum  value 
of  zero;  as  10  l->  °° ,  we  have  L(9)->°o  and  ig(9)M>°°;  inf(i  g(9)i2 -Lap(L(9)))>-<=°  (Lap  here  is 
the  Laplacian,  i.e.,  the  sum  of  the  second  derivatives  of  L(9 )  with  respect  to  each  of  its 
components);  l(3>  (9 )  =  a3L(9 )  /  39 ' 39  '39 '  exists  continuously  with  individual  elements  satisfying 

1 43/  ;  (0)l<a5<cc. 

H4.  The  algorithm  parameters  have  the  form  ak  =a/k,  ck  =  c/k'1 ,  for  k  =  1,2,... ,  where  a,o  0, 

q/a  >  C0  ,  and  y  e  [1/6, 1/2). 

H5 .  [(4p  -  4)  1(4 p  -  3)]1  ^ ' 2  <  lim  inf  (g  (0  )'0  /(I  g  (0 )  1 1 0  I))  . 

Bl— >°° 

H6.  Ek(UQk±ckAk))2  <a4<°o  a.s.  \/k  . 

H7.Let  a>k  beani.i.d.  N(0,n  sequence,  independent  of  the  sequences  {e^ ,...,efy _! } ,  {e1(±),...,e^)1} , 
and  } . 

H8.For  any  n  >  o.z11  <  oo;  jt1!  has  a  unique  weak  limit  n  as  n  ->o . 

H9.  There  exists  a  compact  subset  k  of  Rp  such  that  (e). )  is  tight  in  k. 

Comments: 

(a)  Assumptions  H3,  H5,  and  H8  correspond  to  assumptions  (Al)  through  (A3)  of  GM91; 
assumptions  H4  and  H9  supply  the  hypotheses  stated  in  GM91’s  Theorem  2;  and  the  definitions  of 
ak  and  qk  given  in  equation  (2)  correspond  to  those  used  in  GM91.  Since  we  will  show  that 
assumption  (A4)  of  GM91  is  satisfied  by  our  algorithm,  this  allows  us  to  use  the  conclusion  of  their 
Theorem  2. 

(b)  The  domain  of  y  given  in  H4  is  one  commonly  assumed  for  convergence  results  (e.g.,  Spall 
(1992)). 

We  can  now  state  our  first  theorem  as  follows: 

Theorem  1:  Under  hypotheses  HI  through  H9,  Qk  converges  in  probability  to  the  set  of  global  minima 
of  L(0)  . 

Proof:  See  Maryak  and  Chin  (1999),  and  the  remark  on  convergence  in  probability  in  GM91,  p. 

1003. 


3.  SPSA  WITHOUT  INJECTED  NOISE  AS  A  GLOBAL  OPTIMIZER 

As  indicated  in  the  introduction  above,  the  injection  of  noise  into  an  algorithm,  while  providing  for  global 
optimization,  introduces  some  difficulties  such  as  the  need  for  more  “tuning”  of  the  extra  terms  and 


retarded  convergence  in  the  vicinity  of  the  solution,  due  to  the  continued  addition  of  noise.  This  effect  on 
the  rate  of  convergence  of  an  algorithm  using  injected  noise  is  technically  subtle,  but  may  have  an 
important  influence  on  the  algorithm’s  performance.  In  particular,  Yin  (1999)  shows  that  an  algorithm  of 
the  form  (2)  converges  at  a  rate  proportional  to  ^log  log  (k  + const) ,  while  the  nominal  local  convergence 

rate  for  an  algorithm  without  injected  noise  is  k]n  ,  i.e.,  kin(Qk  -e*)  converges  in  distribution  (Spall 
(1992)).  These  rates  indicate  a  significant  difference  in  performance  between  the  two  algorithms. 

A  certain  characteristic  of  the  SPSA  gradient  approximation  led  us  to  question  whether  SPSA 
needed  to  use  injected  noise  for  global  convergence.  Although  this  gradient  approximation  tends  to 
work  very  well  in  an  SA  recursion,  the  SPSA  gradient,  evaluated  at  any  single  point  in  e  -space,  tends 
to  be  less  accurate  than  the  standard  finite-difference  gradient  approximation  evaluated  at  e  .  So,  one  is 
led  to  consider  whether  the  effective  noise  introduced  (automatically)  into  the  recursion  by  this 
inaccuracy  is  sufficient  to  provide  for  global  convergence  without  a  further  injection  of  additive  noise.  It 
turns  out  that  basic  SPSA  (i.e.,  without  injected  noise)  does  indeed  achieve  the  same  type  of  global 
convergence  as  in  Theorem  1,  but  under  a  different,  and  more  difficult  to  check,  set  of  conditions. 

In  this  Section,  we  designate  Kushner  (1987)  as  K87,  and  Kushner  and  Yin  (1997)  as  KY97. 
Here  we  are  working  with  the  basic  SPSA  algorithm  having  the  same  form  as  equation  (1): 

0A+1  =8*  ~ak8k(Qk)  ■>  (4) 

where  §kc)  is  the  simultaneous-perturbation  approximate  gradient  defined  in  Section  2,  and  now 
(obviously)  no  extra  noise  is  injected  into  the  algorithm.  For  use  in  the  subsequent  discussion,  it  will  be 
convenient  to  define 

bk(hk)  =  E(h(Qk'>~  *k)  ,  and  ek(Qk)=  gk(Qk)-E(gk(Qk) \Xk  ), 

where  xk  denotes  the  o  -algebra  generated  by  {e^e^.-.e*} ,  which  allows  us  to  write  equation  (4)  as 

ei:+l  =  0/t  ~ak\g®k  )+ek(Qk)+bk(®k)]  ■  (5) 

Another  key  element  in  the  subsequent  discussion  is  the  ordinary  differential  equation  (ODE): 

0=s(0),  (6) 

which,  in  Lemma  1  of  the  Appendix  is  shown  to  be  the  “limit  mean”  ODE  for  algorithm  (4). 

Now  we  can  state  our  assumptions  for  Theorem  2,  as  follows: 

J.l  .  Let  A keRp  be  a  vector  of  p  mutually  independent  mean-zero  random  variables 

{Akl,Ak2,...,Akp }'  such  that  {Ak }  is  a  mutually  independent  sequence  and  &k  is  independent  of 
the  sequence  {ej  ,...,0^-1 } ,  and  such  that  Aki  is  v  i,k  symmetrically  distributed  about  zero, 

\Aki  l<oc!  <°o  a.s.  and£l  A~^ \<a2  <°° . 

J.2  Let  e{+)  and  e(~)  represent  random  measurement  noise  terms  that  satisfy  £((e^+) -e^_))l  %)  =  o 
a.s.  Vk  .  The  (e^ }  sequences  need  not  be  assumed  independent.  Assume  that 

^((e®)2  ltti)<a3  <°o  a.s.  Vk  . 

J.3  (a).  L(9 )  is  thrice  continuously  differentiable  and  the  individual  elements  of  the  third  derivative 
satisfy  izj2^  (0)l<a5  <  « . 

(b).  |l(9)|^oo  as  |0 1  — > 00 . 


J.4  The  algorithm  parameters  satisfy  the  following:  the  gains  ak>  o ,  ak  ->o  as  £  ,  and 

Y,k=xak  =°°  •  The  sequence  \ck)  is  of  form  ck  =  dk! ,  where  c  >0  and  y  e  [1/6, 1/2) ,  and 

LOO  2 

k=0(ak  /ck)  <°°- 

J.5  The  gradient  g(9)  is  bounded  and  Lipschitz  continuous. 

J.6  The  ODE  (6)  has  a  unique  solution  for  each  initial  condition. 

J.7  For  the  ODE  (6),  suppose  that  there  exists  a  finite  collection  of  disjoint  compact  stable  invariant 
sets  (see  K87)  KhK2t...,Km  ,  such  that  fy|  .K,  contains  all  the  limit  sets  for  (6).  These  sets  are 

interpreted  as  closed  sets  containing  all  local  (including  global)  minima  of  the  loss  function. 

J.8  For  any  r|  >  o.z11  <  oo;  iuH  has  a  unique  weak  limit  n  as  n  ->  o  (  zn  and  ji11  are  defined  in  Section 
2). 

J.9  E\Y?i=1ei(6i)\<ooVk  . 

J.  10  For  any  asymptotically  stable  (in  the  sense  of  Fiapunov)  point,  e~ ,  of  the  ODE  (6),  there  exists 
a  neighborhood  of  the  origin  in  Rp  such  that  the  closure,  q2  ,  of  that  neighborhood  satisfies 
T  +  g2  =  (T  +  v :  y  e  o2 1  c  0 ,  where  e  c  r  p  denotes  the  allowable  e  -region.  There  is  a 
neighborhood,  Qx ,  of  the  origin  in  Rp  and  a  real- valued  function  continuous  in 

Ql  x  q2  ,  whose  xiq  -derivative  is  continuous  on  Qx  for  each  fixed  y2  e  Q2 ,  and  such  that  the 
following  limit  holds.  For  any  x .  A  >  o ,  with  %  being  an  integral  multiple  of  A ,  and  any  functions 
V1(,).V2(*)  taking  values  in  Q\  xq2  and  being  constant  on  the  intervals  [/A,iA+A),/A<  %  ,  we  have 

J0  Hl(\\fl(s)ty2(s))ds  =  hm  sup—  log£exp|£.=0  \|/j(iA)£  bn+j{ 6n+.)J.  (7) 

m,n  Wl 

Also,  there  is  a  function  H2(y  3 )  that  is  continuous  and  differentiable  in  a  small  neighborhood  of 
the  origin,  and  such  that 

f  //2  (y  i  (s))ds  =  lim  sup—  log E exp  en+j(^n+ j)  •  (8) 

mfi  ,n 

A  bit  more  notation  is  needed.  Fet  r  >  o  be  interpreted  such  that  |0,/’|  is  the  total  time  period 
under  consideration  in  ODE  (6).  Fet 

i(p,^2)=suP[V;(p-gfyf2))-^l,v2)’  ^ 

Vi 

and,  for  (|>(0)  =  jcg  r1  ,  define  the  function 

S(T$)  = 

if  ()>(•)  is  a  real- valued  absolutely-continuous  function  on  | o, r  \  and  to  take  the  value  oo 
otherwise.  S(T,§)  is  the  usual  action  functional  of  the  theory  of  large  deviations  (adapted  to  our 
context).  Define  tn  a‘  ’  an<^  ‘l  =  Y^2oan+i  ■  Define  {fy! }  and  0"(«)  by 

eg  =.t60,  0g+1  =^-an+kgn+k(Qp,  and  en(t)=e"  for  te[t”,t”+1). 

Now  we  can  state  the  last  two  assumptions  for  Theorem  2: 


J.  1 1  For  each  8  >  o  and  i  =  1,2,...,  m ,  there  is  a  p  -neighborhood  of  K, ,  denoted  n„  (K,  ) ,  and 
8  p  >  o,  Tp  <  oo  such  that,  for  each  x,yeNp  ( K, ) ,  there  is  a  path,  <)>(•) ,  with  ofOj  =  x,  <( >(TV)  =  y , 
where  Ty  < 7p  and  S(Tp,ty)<8  . 

J.12  There  is  a  sphere,  D{ ,  such  that  /),  contains  |J  w,  in  its  interior,  and  the  trajectories  of  e  ”(•) 
stay  in  Dx .  All  paths  of  ODE  (6)  starting  in  D\  stay  in  i)\ . 

Note  1.  Assumptions  Jl,  J2,  and  J3(a)  are  from  Spall  (1992),  and  are  used  here  to  characterize  the 
noise  terms  bk(Qk)  and  ek(Qk).  Assumption  J3(b)  is  used  on  page  178ofK87.  Assumption  J4 
expresses  standard  conditions  on  the  algorithm  parameters  (see  Spall  (1992)),  and  implies 
hypothesis  (A10.2)  in  KY97,  p.  174.  Assumptions  J5  and  J6  correspond  to  hypothesis 
(A10.1)  in  KY97,  p.  174.  Assumption  J7  is  from  K87,  p.  175.  Assumption  J8  concerns  the 
limiting  distribution  of  efe .  Assumption  J9  is  used  to  establish  the  “mean”  criterion  for  the 
martingale  sequence  in  Lemma  2.  Assumptions  Jl  1  and  J12  are  the  “controllability”  hypothesis 
A4.1  and  the  hypothesis  A4.2,  respectively,  of  K87,  p.  176. 

Note  2.  Assumption  J10  corresponds  to  hypotheses  (A10.5)  and  (A10.6)  in  KY97,  pp.  179-181. 
Although  these  hypotheses  are  standard  forms  for  this  type  of  large  deviation  analysis,  it  is 
important  to  justify  their  reasonableness.  The  first  part  (equation  (7),  involving  noise  terms 
bk (Q k ) )  of  J10  is  justified  by  the  discussion  in  KY97,  p.  174,  which  notes  that  the  results  of  their 
subsection  6.10  are  valid  if  the  noise  terms  (that  they  denote  )  are  bounded.  This  discussion 
is  applicable  to  our  algorithm  since  the  bk(Qk)  noise  terms  were  shown  by  Spall  (1992)  to  be 
CHc7-  )  ( ck  ->  0)  a.s.  The  second  part  (equation  (8),  involving  noise  terms  ek(Qk) )  is  justified  by 

the  discussion  in  KY97,  p.  174,  which  notes  that  the  results  in  their  subsection  6.10  are  valid  if 
the  noise  terms  they  denote  8 Mn  (corresponding  to  our  noise  terms  ek  <Qk ) )  satisfy  the 
martingale  difference  property  that  we  have  established  in  Lemma  2  of  the  Appendix. 

Now  we  can  state  our  main  theorem: 

Theorem  2.  Under  assumptions  Jl  through  J12,  e*  converges  in  probability  to  the  set  of  global  minima 

Of  L(0)  . 

The  idea  of  the  proof  is  as  follows  (see  the  Appendix  for  the  details).  This  theorem  follows  from  results 
(in  a  different  context)  in  K87  for  an  algorithm  Qk+l  =Qk-ak[g(Qk)+C,k] ,  where  C,k  is  i.i.d.  Gaussian 
(injected)  noise.  In  order  to  prove  our  Theorem  2,  we  start  by  writing  the  SPSA  recursion  as 
0/.+I  =  Qk -ak[g(Qk)+^k] ,  where  G*  =  h(®k)-g®k)  is  the  “effective  noise”  introduced  by  the  inaccuracy 
of  the  SPSA  gradient  approximation.  So,  our  algorithm  has  the  same  form  as  that  in  K87.  However, 
since  Ck  is  not  i-i-d.  Gaussian,  we  cannot  use  K87’s  result  directly.  Instead,  we  use  material  in  Kushner 
and  Yin  (1997)  to  establish  a  key  “large  deviation”  result  related  to  our  algorithm  (4),  which  allows  the 
result  in  K87  to  be  used  with  C  replacing  the  t,k  in  his  algorithm. 


4.  NUMERICAL  STUDIES:  SPSA  WITHOUT  INJECTED  NOISE 


4.1.  Two-Dimensional  Problem 


A  study  was  done  to  compare  the  performance  of  SPSA  to  a  recently  published  application  of  the 
popular  genetic  algorithm  (GA).  The  loss  function  is  the  well-known  Griewank  function  (see  Haataja 
(1999))  defined  for  a  two-dimensional  e  =  (ti,t2Y ,  by: 

L(0 )  =  cos(t,  -  100)cos[(t2  - 100) /-Jl]  -  [(/j  - 100) 2  +  (t2  - 100)2  ]  /4000  - 1 , 
which  has  thousands  of  local  minima  in  the  vicinity  of  a  single  global  minimum  at  e  =  (loo.ioo)'  at  which 
L(e )  =  o .  Haataja  (1999)  describes  the  application  of  a  GA  to  this  function  (actually,  to  find  the 
maximum  of  -u e ))  based  on  noise-free  evaluations  of  L(@)  (i.e.,  ek  =  o).  This  study  achieved  a 
success  rate  of  66%  (see  Haataja’s  Table  1.3,  p.16)  in  50  independent  trials  of  the  GA,  using  300 
generations  and  9000  L(0)  evaluations  in  each  run  of  the  GA.  Haataja’s  definition  of  a  successful 
solution  is  a  reported  solution  where  the  norm  of  the  solution  minus  the  correct  value,  0  * ,  is  less  than 
0.2,  and  the  value  of  the  loss  function  at  the  reported  solution  is  within  0.01  of  the  correct  value  of  zero. 
We  examined  the  performance  of  basic  SPSA  (without  adding  injected  noise)  on  this  problem,  using 
ak  =a/(A  +  k)a ,  with  A  =  60,  a  =  loo  and  a  =  .602 ,  a  slowly  decreasing  gain  sequence  of  a  form  that  has 
been  used  in  many  applications  (see  Spall  (1998)).  For  the  gradient  approximation  (equation  (3)),  we 
chose  each  component  of  Ak  to  be  an  independent  sample  from  a  Bernoulli  (±i)  distribution,  and 

ck  =c/k'i ,  with  c  =  10  and  y  =  .101 .  Since  we  used  the  exact  loss  function,  the  ek  noise  terms  were  zero. 
We  ran  SPSA,  allowing  3000  function  evaluations  in  each  of  50  runs,  and  starting  the  algorithm  (each 
time)  at  a  point  randomly  chosen  in  the  domain  [-200, 400]x[-200,400] .  Haataja’s  0  -domain  was  also 
constrained  to  he  in  a  box,  but  the  dimensions  of  the  box  were  not  specified.  Hence  we  chose  a  domain 
that  is  a  cube  centered  at  the  global  minimum,  in  which  there  are  many  local  minima  of  w )  (as  seen  in 
Haataja’s  (1999)  Figure  1.1).  SPSA  successfully  located  the  global  minimum  in  ah  50  mns  (100% 
success  rate). 

4.2.  Ten-Dimensional  Problem 

For  a  more  ambitious  test  of  the  global  performance  of  SPSA,  we  applied  SPSA  to  a  loss  function 
given  in  Example  6  of  Styblinski  and  Tang  (1990),  which  we  will  designate  for  convenience  as  ST90. 
Hie  loss  function  is: 

p  p 

L(0 )  =  (2 p)  -4Pncos(fi) , 

i—l  i—\ 

where  p  =  to  and  0  =  (t\ ,..., tp )' .  This  function  has  the  global  minimum  value  of  -40  at  the  origin,  and  a 

large  number  of  local  minima.  As  in  the  two-dimensional  study  above,  we  used  the  exact  loss  function. 
Our  goal  is  to  compare  the  performance  of  SPSA  without  injected  noise  with  simulated  annealing  and 
with  a  GA. 

For  the  simulated  annealing  algorithm,  we  use  the  results  reported  in  ST90.  They  used  an 
advanced  form  of  simulated  annealing  called  fast  simulated  annealing  (FSA).  According  to  ST90,  FSA 
has  proven  to  be  much  more  efficient  than  classical  simulated  annealing  due  to  using  Cauchy  (rather  than 
Gaussian)  sampling  and  using  a  fast  (inversely  linear  in  time)  cooling  scheme.  For  more  details  on  FSA, 


see  ST90.  The  results  of  their  application  of  FSA  to  the  above  L(e )  are  given  in  Table  1  below  (FSA 
values  taken  from  Table  10  of  ST90).  Table  1  shows  the  results  of  10  independent  runs  of  each 
algorithm.  In  each  case  (each  run  of  each  algorithm),  the  best  value  of  w )  found  by  the  algorithm  is 
shown.  In  their  study,  although  FSA  was  allowed  to  use  50,000  function  evaluations  for  each  of  the 
mns,  the  algorithm  showed  very  limited  success  in  locating  the  global  minimum.  It  should  be  noted  that 
the  main  purpose  of  the  ST90  paper  was  to  examine  a  relatively  new  algorithm,  stochastic 
approximation  combined  with  convolution  smoothing.  This  algorithm,  which  they  call  SAS,  was  much 
more  effective  than  FSA,  yielding  results  between  those  shown  in  Table  1  for  GA  and  SPSA. 

For  the  genetic  algorithm  (GA),  we  implemented  a  GA  using  the  popular  features  of  elitism  (elite 
members  of  the  old  population  pass  unchanged  into  the  new  population),  tournament  selection 
(tournament  size  =  2),  and  real-number  encoding  (see  Mitchell  (1996),  pp.  168,  170,  and  157, 
respectively).  After  considerable  experimentation,  we  found  the  following  settings  for  the  GA  algorithm 
to  provide  the  best  performance  on  this  problem.  The  population  size  was  100,  the  number  of  elite 
members  (those  carried  forward  unchanged)  in  each  generation  was  10,  the  crossover  rate  was  0.8, 
and  mutation  was  accomplished  by  adding  a  Gaussian  random  variable  with  mean  zero  and  standard 
deviation  0.01  to  each  component  of  the  offspring.  The  original  population  of  100  (10-dimensional)  e  - 
vectors  was  created  by  uniformly  randomly  generating  points  in  the  10-dimensional  hypercube  centered 
at  the  origin,  with  edges  of  length  6  (so,  all  components  had  absolute  value  less  than  or  equal  to  3 
radians).  We  constrained  all  component  values  in  subsequent  generations  to  be  less  than  or  equal  to  4.5 
in  absolute  value.  This  worked  a  bit  better  than  constraining  them  to  be  less  than  3,  since,  with  the 
tighter  constraints,  the  GA  got  stuck  at  the  constraint  boundary  and  could  not  reach  local  minima  that 
were  just  over  the  boundary.  All  mns  of  the  GA  algorithm  reported  here  used  50,000  evaluations  of  the 
loss  function.  Hie  results  of  the  10  independent  mns  of  GA  are  shown  in  Table  1.  Although  the 
algorithm  did  reasonably  well  in  getting  close  to  the  minimum  loss  value  of  -40,  it  only  found  the  global 
minimum  in  one  of  the  10  mns  (mn  #8).  In  the  other  nine  cases,  a  few  (typically  two  or  four)  of  the 
components  were  trapped  in  a  local  minimum  (around  ± pi  radians),  while  the  rest  of  the  components 
(approximately)  achieved  the  correct  value  of  zero.  Note  that  the  nature  of  the  loss  function  is  such  that 
the  value  of  L(9)  is  very  close  to  an  integer  (e.g.,  -39.0  or  -38.0)  when  an  even  number  (e.g.,  2  or  4) 
of  components  of  e  are  near  ±pi  radians. 

We  examined  the  performance  of  basic  SPSA  (without  adding  injected  noise),  using  the 
algorithm  parameters  defined  in  Subsection  4.1  with  A  =  60,  a  =  l,  a  =  .602,  c  =  2,and  y  = .  l  o  i .  We 
started  e  at  /,  =  3  radians,  i  =  t ,  resulting  in  an  initial  loss  function  value  of  -31 .  This  choice  of 
starting  point  was  at  the  outer  boundary  of  the  domain  in  which  we  chose  initial  values  for  the  GA 
algorithm,  and  we  did  not  constrain  the  search  space  for  SPSA  as  we  did  for  GA  (the  initialization  and 
search  space  for  FSA  were  not  reported  in  ST90).  We  ran  10  Monte  Carlo  trials  (i.e.,  randomly 
varying  the  choices  of  Ak  ).  The  results  are  tabulated  in  Table  1.  The  results  of  these  numerical  studies 
show  a  strong  performance  of  the  basic  SPSA  algorithm  in  difficult  global  optimization  problems 


Table  1.  Best  Loss  Function  Value  in  Each  of  10  Independent  Runs  of  Three  Algorithms 


Run 

SPSA 

GA 

FSA 

1 

-40.0 

-38.0 

-24.9 

2 

-40.0 

-39.0 

-15.5 

3 

-40.0 

-39.0 

-29.0 

4 

-40.0 

-38.0 

-32.1 

5 

-40.0 

-37.0 

-30.2 

6 

-40.0 

-39.0 

-30.1 

7 

-40.0 

-38.0 

-27.9 

8 

-40.0 

-40.0 

-20.9 

9 

-40.0 

-38.0 

-28.5 

10 

-40.0 

-39.0 

-34.6 

Average  Value 

-40.0 

-38.5 

-27.4 

Number  of  Function  Evaluations 

2,500 

50,000 

50,000 

5.  SUMMARY 

SPSA  is  an  efficient  gradient-free  SA  algorithm  that  has  performed  well  on  a  variety  of  complex 
optimization  problems.  We  showed  in  Section  2  that,  as  with  some  standard  SA  algorithms,  adding 
injected  noise  to  the  basic  SPSA  algorithm  can  result  in  a  global  optimizer.  More  significantly,  in 
Section  3,  we  showed  that,  under  certain  conditions,  the  basic  SPSA  recursion  can  achieve  global 
convergence  without  the  need  for  injected  noise.  The  use  of  basic  SPSA  as  a  global  optimizer  can 
ease  the  implementation  of  the  global  optimizer  (no  need  to  tune  the  injected  noise)  and  result  in  a 
significantly  faster  rate  of  convergence  (no  extra  noise  corrupting  the  algorithm  in  the  vicinity  of  the 
solution).  In  the  numerical  studies,  we  found  significantly  better  performance  of  SPSA  as  a  global 
optimizer  than  for  the  popular  simulated  annealing  and  genetic  algorithm  methods,  which  are  often 
recommended  for  global  optimization.  In  particular,  in  the  case  of  a  10-dimensional  optimization 
parameter  (e  ),  the  fast  simulated  annealing  and  genetic  algorithms  generally  failed  to  find  the  global 
solution. 

APPENDIX  (LEMMAS  RELATED  TO  THEOREM  2  AND  PROOF  OF  THEOREM  2) 

In  this  Appendix,  we  designate  Kushner  (1987)  as  K87,  and  Kushner  and  Yin  (1997)  as  KY97.  Here 
we  are  working  with  the  basic  SPSA  algorithm  as  defined  in  equation  (4): 

®fc+l  =  ®k  ~ak8k(Qk)  ■ 

We  first  establish  an  important  prehminary  result  that  is  needed  in  order  to  apply  the  results  from  K87 
and  KY97  in  the  proof  of  Theorem  2. 

Lemma  1.  The  ordinary  differential  equation  (eq.  (6)  above), 

6  =  g( 0), 

is  the  “limit  mean  ODE”  for  algorithm  (4). 

Proof:  Examining  the  definition  of  limit  mean  ODE  given  in  KY97,  pp.  174  &  138,  it  is  clear  that  we 
need  to  prove  that  -L  [g(e ) +  ek  (Qk )  +  bk  (Qk  >  ]  g(6 )  w.p.  1  as  m,n  ->  oo .  Since  Spall 


(1992)  has  shown  that  bk %)  ->  0  w.p.  1,  we  can  conclude  using  Cesaro  summability  that  the 
contribution  of  the  bk(Qk)  terms  to  the  limit  is  zero  w.p.  1.  For  the  ek(6k)  terms,  we  have  by 
definition  that  E[ek  (Qk )]  =  0 ;  hence,  by  the  law  of  large  numbers,  the  contribution  of  the  e k  (0  k ) 
terms  to  the  limit  is  also  zero.  Q.E.D. 

Our  next  Lemma  relates  to  Note  2  in  Section  3. 

Lemma  2.  Under  assumptions  Jl,  J3(a),  and  J9,  the  sequence  {ek(6k )}  is  a  -martingale  difference. 
Proof:  It  is  sufficient  to  show  that  Mk  =  ek  <ek )  is  a  sk  -martingale.  Assumption  J9  satisfies  the  first 
requirement  (see  KY97,  p.68)  of  the  martingale  definition,  that  E\Mk |  <  °° .  For  the  main 
requirement,  we  have  for  any  k  : 

=  Mk  +E[ek+i(Qk+i)\Mk,...,Mi\ 

=  Mk+E{ [gM (0 ,+1 )  -  E(gk+l (e,+1 )  1 0M )]  I  Mk } 

=  Mk+E^E{[gk+l{Qk+1)-E{gk+l(Qk+l)  10**)]  I  Mjk+l] 

=  Mk  +E6  E(gMQM)\Mk,QM)-Ef  E[E(gk+1(Qk+l)\Mk,Qk+l]  =  Mk, 

V/c+1  °k+l 

where  denotes  expectation  conditional  on  efc+1 ,  and  all  equahties  concerning  conditional 
expectations  are  w.p.  1.  Q.E.D. 

A  key  step  in  the  proof  of  our  main  result  (Theorem  2  below)  is  establishing  the  following  “large 
deviation”  result  (Lemma  3).  Let  Bx  be  a  set  of  continuous  functions  on  [0,7’]  taking  values  in  ©  and 

with  initial  value  x  .  Let  Bx  denote  the  interior  of  Bx ,  and  bx  denote  the  closure. 

Lemma  3.  Under  assumptions  J4,  J5,  J6,  and  J 10,  we  have 

-  inf  5  ( 7\<)> )  <  lim  inf  log  P"  {0  "  (•)  e  Bx } 

<1 >efl$ 

<  lim  suplogP”{0  ”(•)£  Bx  j  <  -  inf  S(7\(|) ) ,  (9) 

n  0-  Bx 

where  px  denotes  the  probabihty  under  the  condition  that  0"(O)  =  x. 

Proof:  This  result  is  adapted  from  Theorem  10.4  in  KY97,  p.  181.  Note  that  our  assumption  J10  is  a 
modified  form  of  their  assumptions  (A10.5)  and  (A10.6),  using  “equals”  signs  rather  than 
inequalities.  The  two-sided  inequality  in  (9)  follows  from  J10  by  an  argument  analogous  to  the 
proof  of  KY87’s  Theorem  10.1  (p.  178),  which  uses  an  “equality”  assumption  ((A10.4),  p. 
174)  to  arrive  at  a  two-sided  large  deviation  result  analogous  to  (9)  above.  Q.E.D. 

We  restate  our  main  theorem: 

Theorem  2:  Under  hypotheses  Jl  through  J12,  Qk  converges  in  probabihty  to  the  set  of  global  minima 

Of  L(0)  . 

Proof:  This  result  follows  from  a  discussion  in  K87.  Theorem  2  of  K87,  (p.177)  describes 

probabilities  involving  expected  times  for  the  SA  algorithm  (system  (1.1)  of  K87)  to  transition 


from  one  K,  to  another.  The  SA  algorithm  he  uses  can  be  written  in  our  notation  as 
Qk+i  =®k-ak[g(%k)+^k\ ,  where  C,k  is  i.i.d.  Gaussian  (injected)  noise.  The  K87  Theorem  2  uses 
the  i.i.d.  Gaussian  assumption  only  to  arrive  at  a  large  deviation  result  exactly  analogous  to  our 
Lemma  3.  The  subsequent  results  in  K87  are  based  on  this  large  deviation  result.  Recall  that 
the  SPSA  algorithm  without  injected  noise  can  be  written  in  the  form  0~/(+l  =e/{  -ak[g(Qk)+^k ] . 
Since  we  have  established  Lemma  3  for  SPSA,  the  results  of  K87  hold  for  the  SPSA  algorithm 
with  its  “effective”  noise  (C)j  replacing  the  gk)  sequence  used  in  K87.  In  particular,  K87’s 
discussion  (pp.  178,  179)  of  his  Theorem  2  is  applicable  to  our  Theorem  2  context  (SPSA 
without  injected  noise),  which  corresponds  to  K87’s  “potential  case.”  Note  that  our 
formulation  corresponds  to  the  K87  setup  where  b(x&)  =  b(x)  in  his  notation,  which,  by  the 
comment  in  K87,  p.  179,  means  that  his  discussion  is  applicable  to  his  system  (1.1)  and  hence 
to  our  setup.  In  his  discussion  on  p.  179,  K87  indicates  that  the  difference  between  the 
measure  of  xn  (which  corresponds  to  our  Qk  )  and  the  invariant  measure  (which  we  have 

denoted  ji11  )  converges  asymptotically  («,£->  °°,r|  o )  to  the  zero  measure  weakly.  This  means 
that,  in  the  limit  as  k  -*  «> ,  Qk  is  equivalent  to  n  in  the  same  sense  as  in  Theorem  2  of  Gelfand 
and  Mitter  (1991),  and  the  desired  convergence  in  probability  follows  as  in  Theorem  1  above. 
Q.E.D. 
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